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Abstract 

A  new  approach  is  introduced  to  3-D  parameterized 
object  estimation  and  recognition.  Though  the  theory  is  appli¬ 
cable  for  any  parameterization,  we  use  a  model  for  which 
objects  are  approximated  by  patches  of  spheres,  cylinders,  and 
planes — primitive  objects.  These  primitive  surfaces  are  special 
cases  of  3-D  quadric  surfaces.  Primitive  surface  estimation  is 
treated  as  parameter  estimation  using  data  patches  in  two  or 
more  noisy  images  taken  by  calibrated  cameras  in  different 
locations  and  from  different  directions.  Included  is  the  case  of 
a  single  moving  camera.  Though  various  techniques  can  be 
used  to  implement  this  nonlinear  estimation,  we  discuss  the 
use  of  gradient  descent.  Experiments  are  run  and  discussed 
for  the  case  of  a  sphere  of  unknown  location.  It  is  shown  that 
the  estimation  procedure  can  be  viewed  geometrically  as  a 
cross  correlation  of  nonlinearly  transformed  image  patches 
in  two  or  more  images.  Approaches  to  object  surface  segmen¬ 
tation  into  primitive  object  surfaces,  and  primitive  object-type 
recognition  are  briefly  presented  and  discussed.  The  attrac¬ 
tiveness  of  the  approach  is  that  maximum  likelihood  estima¬ 
tion  and  all  the  usual  tools  of  statistical  signal  analysis  can  be 
brought  to  bear,  the  information  extraction  appears  to  be 
robust  and  computationally  reasonable,  the  concepts  are 
geometric  and  simple,  and  close  to  optimal  accuracy  should 
result. 


A.  Introduction 

Essentially  all  3-D  object  surface  estimation  from  multi¬ 
ple  views  to  date  is  based  on  either  active  stereo  using  a  laser 
and  one  or  two  cameras  for  triangulation,  or  on  passive  stereo 
involving  matching  points  in  two  images  and  using  triangula¬ 
tion,  or  on  optical  flow.  We  suggest  a  new  approach  in  which 
surfaces  of  complex  objects  are  locally  approximated  by  3-D 
parameterized  surfaces,  and  these  parameters  are  estimated 
from  two  or  more  images  taken  by  calibrated  cameras  from 
different  locations  and  directions.  Our  approach  depends  on  a 
simple  representation  for  the  transformation  of  one  image  into 
another,  based  on  knowledge  of  camera  positions  and  on  the  a 
priori  unknown  3-D  object  parameter  values.  Estimation  accu¬ 
racy  is  achieved  by  processing  data  in  blocks.  The  actual  pro¬ 
cessing  is  simple  standard  statistical  signal  analysis. 

Though  this  approach,  first  presented  in  [1],  is  completely 
new  as  far  as  we  know,  it  has  benefited  from  previous  work, 
e.g.,  work  by  Schalkoff  and  McVey  on  tracking  [2],  and  shares 
some  concepts  with  the  recent  exciting  work  of  Waxman  and 
Wohn  [3]. 


B.  Notation  and  Description  of  Camera  Motion 

Let  P  be  a  point  in  3-D  space  and  rT  =  (xy  z)^  be  its 
representation  in  the  fixed  orthogonal  world  reference  frame. 
Since  we  assume  that  objects  do  not  move,  this  reference 
frame  is  fixed  with  respect  to  the  objects  viewed  by  the  cam¬ 
era,  and  we  will  call  it  the  object  reference  frame  (ORF).  Let 
r/T  —  (x'  y'  z;  )  be  the  representation,  of  the  point  P,  in  the 
orthogonal  reference  frame  attached  to  the  camera.  This  refer¬ 
ence  frame,  the  camera  reference  frame  (CRF),  is  such  that:  (1) 
The  camera  optical  axis  is  parallel  to  the  z  axis,  and  it  looks  at 
the  negative  z  axis.  (2)  The  x  and  y  axes  are  parallel  to  the 
sides  of  the  image.  (3)  The  origin  of  the  camera  reference 
frame  coincides  with  the  center  of  the  image.  The  image  is 
corrected  so  that  the  view  is  not  inverted  top  to  bottom  and 
left  to  right,  i.e.,  a  central  projection  is  used. 

Let  B  denote  the  3X3  orthogonal  rotation  matrix  that 
specifies  the  three  unit  coordinate  vectors  for  the  CRF  in 
terms  of  the  three  unit  coordinate  vectors  for  the  ORF.  Let  rc 
specify  the  origin  of  the  CRF  in  the  ORF.  Then 

r'  =  BT(r  -  rc),  and  r  =  Br'  +  rc  (l) 

Note  that  the  rotation  matrix  B  and  the  translation  vector  rc 
are  assumed  to  be  known. 

C.  Images  of  an  Object  Surface  Point  in  Two  Image  Frames. 

Fig.  1  illustrates  the  orthographic  imaging  model  used.  P 
denotes  a  point  on  a  parameterized  3-D  surface  of  interest. 
This  surface  is  described  by  a  function  in  the  ORF.  The 
function  is  uniquely  determined  by  specifying  the  values  of  a 
parameter  vector  a.  Point  P  on  the  object  surface  is  seen  as 
points  having  coordinates  s  and  u  in  images  1  and  2,  respec¬ 
tively.  We  assume  a  Lambertian  reflectance  model  Then  the 
images  of  point  P  at  s  and  u  will  have  the  same  intensity. 
The  techniques  proposed  will  not  apply  to  specular  reflectors, 
without  modification,  because  the  location  of  points  on  the 
object  surface  at  which  specular  reflection  occurs  depends  on 
the  camera  location.  Since  most  surfaces  of  interest  are  largely 
Lambertian,  the  assumption  is  a  useful  one.  Hence 
I2(u)  =  Ij(s)  where  I^s)  and  I2(u)  are  the  picture  functions 
(image  intensity  functions)  in  Frames  1  and  2,  respectively. 

Though  our  approach  is  applicable  to  any  parameterized 
surface,  our  interest  at  present  is  primarily  in  planar,  cylindri¬ 
cal,  and  spherical  primitive  surfaces.  The  equation  of  the  gen¬ 
eral  quadric  surface  is  (2).  The  three  primitive  quadrics  of 


t  A  symbol  in  boldface  is  a  column  vector,  a  superscript  capital  T  attached  to  a 
vector  denotes  vector  transpose. 
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interest  are  obtained  by  imposing  suitable  constraints  among 
the  coefficients  in  (2). 

aux2  -4-  2a12xy  +  2a13xz  +  a29y2  +  2ao3yz  +  a33z2  -F 

(2) 

2aHx  -f  2a24y  -I-  2a34z  +  a44  =  0 

Denote  &T  —  (an>  a12i-..,  a44).  In  this  paper,  we  take  the 
ORF  to  be  the  CRF  1,  i.e.,  the  a,  vector  is  that  for  describing 
the  object  surface  in  the  first  camera  reference  frame.  Thus, 
the  image  at  point  s1  =  (x(l),  y(  1 ))  in  the  first  image  frame  is 
the  image  of  a  point  having  coordinates  x(l),  y(l),  z(l)  on  the 
object  surface.  We  see  that  the  picture  function  at  point  s  in 
CRF1  is  the  picture  function  at  point  u  in  CRF2,  where 

(u,  42))T  -  BT(r(l)  -  pc(1  ))  (3) 

Parameters  specifying  this  transformation  are  three  rotation 
angles  specifying  the  rotation  matrix  B,  and  three  translation 
components  specifying  the  location  rc(l)  of  the  origin  of  CRF2 
in  CRF1.  Denote  this  six  component  parameter  vector  by  b. 
Denote  the  functional  relationship  between  s  and  u  by  (4), 
with  h(s,  b,  z  (s,  a))  given  by  (3). 

u  =  h(s,  b,  z(s,  a))  (4) 


Note  u  depends  explicitly  on  s  and  z(l),  and  z(l)  is  determined 
by  both  s  and  a  as  shown  in  (2). 

D.  Estimation  of  a  Using  Two  Images 

If  b  is  known  and  a  is  at,  the  true  a,  then  (5)  holds  for 
each  s.  Choose  a  square  MXM  pixel  window  in  CRF1. 

•l(s)  =  WM*.  b>  z(s.  at))  (5) 

Denote  this  pixel  set  by  D.  Consider  the  error  measure  (6). 
Then  e^a)  is  a  minimum  at  a  at.  Our  problem  is  to  esti¬ 
mate  by  minimizing  (6)  with  respect  to  a. 

ec(a)  =  M'2S[Ii(s)  -  I2(h(®,  b.  z(s,  a))]2 


scD 


(6) 


An  interpretation  of  (5)  is  that  the  image  I2(u)  can  be 
transformed  into  the  image  I^s)  by  a  varying  scale  change 
that  locally  consists  of  a  rotation,  a  nonisotropic  stretching, 
and  a  translation. 

The  minimization  technique  we  have  used  is  gradient  des¬ 
cent.  The  gradient  of  (6)  is: 


deD 

5a 


=  -2M" 


EP1W-I2W] 

scD 


W 

da, 


with 


dl2(u) 

5a 


^^(u)  j  5h(s,  b,  r.(s,  a))  5z(s,  a) 


5u 


5z 


da. 


(7) 


(8) 


In  general,  it  is  inconvenient  to  express  z  as  an  explicit  func¬ 
tion  of  a.  Hence,  to  proceed  further  denote  the  left  side  of  (2) 
by  g(x,  y,  z,  a  )■  Then 

dz(s-  =  1  MMjMl/  MMjML  follows  from  (2).  Put- 
5a  5a  /  5z 

ting  the  preceding  together  results  in  (9). 

5I2(u)  ]  5h(s,  b,  z) 


-^1  =  2M-2S:il(s)  -  I2(u)] 

S€D 


(x,y,z,a)  5g(x,y,z,a) 


5u 


5z 


5a 


5z 


(9) 


Let  B13  and  B23  denote  the  first  two  elements  of  the  last 
column  of  the  3X3  matrix  BT  in  (3).  Then 
5his,  b,  z)_  j-gi3  g23yr  Hence,  computation  of  (9)  involves 

dz  .  ai2(u) 

processing  the  data  to  obtain  Ii(s),  I2(u),  — —  ,  and 

computing  (B13  B23)T  and  the  last  two  partial  derivatives  from 
the  known  camera  motion  and  knowledge  of  g(x,y,z,a). 

A  steepest  descent  algorithm  for  minimizing  (8)  is  (10). 

A  (10) 


—  am  - 


5a 


We  use  a  Am  that  depends  on  eD(am)  and 


c)eD(a„ 


ha.*;  magnitude  that  eoes  to  0  as  m  eoes  to  infinity. 


5a 


and 


E.  An  Example:  The  Sphere 


To  illustrate  the  approach,  consider  a  spherical  surface 
described  by  the  equation 

(x  -  x0)2  +  (y  -  yo)2  +  (z  -  z0)2  =  R2,  which  can  be  solved 
explicitly  for  z,  viz,  z  =  z0  ±  (R~  -  (x  -  x0)2  -  (y  -  y0)2)1^2- 
The  positive  square  root  is  used  since  the  outside  surface  of 


the  sphere  is  seen  by  the  camera  looking  in  the  negative  z 
-  5z(s,  a)  x1’  ,  5z  5z  5z  5z  1 


direction.  Hence,  (- 


L)  =(- 


5a  '  v  5x0  5y0  5z0  5R 

(  (x- x0)/(z  -  z0),  (y  -  y0)/(z  -  O,  1,  R/(z  -  zo)  )r 


) 


and  z  -  z0  =  (R2  -  (x  -  x0)2  -  (y  -  y0)2  )1/2.  The  vector 
can  be  computed  directly  from  this. 


F.  Incremental  Stereo:  Object  Estimation  from  a 
Sequence  of  Views 

Instead  of  estimating  a  from  two  frames,  assume  now 
that  there  is  a  moving  camera  with  known  motion  and  produc¬ 
ing  a  sequence  of  images  I1(*)J  I2(*),  ....  The  camera  stops 
momentarily  while  an  image  is  taken,  though  the  algorithms 
can  be  modified  in  order  that  this  assumption  not  be  neces¬ 
sary.  There  are  a  number  of  significant  advantages  in  using  a 
sequence  of  images  taken  from  successive  camera  locations  and 
directions  differing  only  by  small  amounts  from  one  to  the 
next.  One  is  that  the  processing  is  simpler,  since  the  image 
data  arrays  and  derived  arrays  in  (12)  are  all  computed  at  the 
same  s.  The  price  incurred  for  this  is  that  the  effective  signal- 
to-noise  ratio  for  estimating  at  is  then  very  small  when  using 
two  images  differing  little  from  one  another.  (The  noise 
present  is  sensor  noise,  quantization  noise,  and  other  system 
noise  and  uncertainty.)  However,  since  many  images  are  avail¬ 
able  in  this  incremental  stereo  approach,  a  great  deal  of 
“dynamic”  averaging  is  possible,  and  the  noise  can  be 
effectively  combatted. 

Let  D(n)  denote  a  rectangular  MxM  pixel  data  window  in 
the  nth  frame.  Let  e^{0)(a)  denote  the  error 
«D(n)(a)  =  M~2  E  [I„(s)  -  In+i(h(s,  c(n),  z(s,  a)))]2.  We  seek 

sc£>(n) 

the  a  that  minimizes  (11).  The  parameter  vector  c(n)  specifies 
the  transformation  from  CRFn  to  CRF(n+l),  which  may  be 
time-varying. 

N-l 

E  cD(n)(a)  (11) 

n=l 
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Since,  In(-)  and  ■)  are  assumed  not  to  differ  by  much, 

I»  =  •n  +  l(u)SS>ll«+l(s)  +  <7ln.^'Sl  [»  -  *]•  Hence- 


m-2  v 

scD(n) 


ds 


T 

[h(s,  c(n),  z(s, 


[  3I0+i(s) 

T 

dh(s,  c(n),  z(s,  a)) 

ds 

da 

where  dk(s>c(n);  2'(s>a)]  \s  a  2X3  matrix  with  column  j  being 

da 

the  partial  derivatives  of  the  two  components  of 
h(s,c(n),  z(s,a))  with  respect  to  aj.  The  beauty  of  (12)  is  that 

it  is  easier  to  compute  than  is  (7)  because  In+i(')  and  — — 

OS 

are  used  at  s  rather  than  at  the  transformed  point  u. 

Previously,  the  object  surface  was  defined  in  CRF1.  For 
this  section,  the  object  equation  must  be  transformed  to 
CRFn. 

Note  that  working  with  (12)  requires  the  storage  of  2(N- 
1)  arrays,  each  MXM,  derived  from  the  N  images.  There  are 
2  MXM  arrays  associated  with  each  pair  of  successive 
images.  This  is  a  great  deal  of  data  to  store.  A  compromise  is 
to  use  In(*)  to  compute  and  aD_^  in  (10)  and  to  then  discard 
it.  Hence,  the  only  data  used  in  computing  an+1  would  then 
be  In+1(-)  and  In(*).  If  An  goes  to  0  at  an  appropriately  slow 
rate,  it  is  possible  to  achieve  estimation  accuracy  that  is  com¬ 
parable  to  that  achievable  by  using  all  of  the  data—  Ix( 
In+i(-)—  simultaneously  as  in  (11).  This  is  a  “stochastic 
approximation”  type  of  approach  to  our  parameter  estimation 
problem  [6]. 


G.  Algorithm  Operation  Interpretation 

Fig.  2  is  useful  for  illustrating,  in  two  dimensions,  the 
operation  of  our  algorithm  for  estimating  at.  Spheres  in  3-D 
are  shown  as  circles.  Consider  the  processing  of  the  image 
patch  between  points  s'  and  s"  in  Frame  1.  This  patch  is 
the  image  of  the  patch  between  points  p'  and  p"  on  the  true 
sphere  labeled  a*,.  The  same  patch  on  the  sphere  surface  gives 
rise  to  the  image  patch  between  points  u'  and  u'7  in  Frame 
2.  Now  suppose  the  system’s  estimation  of  is  a.  The  asso¬ 
ciated  sphere  is  shown.  The  performance  functional  for  the 
estimate  of  a  is  given  by  (6)  and  is  computed  as  follows.  The 
system  thinks  that  the  locations  on  the  sphere  surface  that 
give  rise  to  the  images  at  points  s'  and  s'7  in  Frame  1  are  the 
intersections  of  the  dashed  lines,  from  s'  and  s'7  ,  with  the 
sphere  labeled  a.  These  sphere  surface  points  would  be  seen  as 
the  images  at  point  V  and  u'7  in  Frame  2.  Hence,  the 
system  takes  the  image  patch  between  points  u'  and  u'7  in 
Frame  2  and  assumes  that  the  image  at  each  point  u  in  this 
interval  is  the  same  image  as  the  image  at  a  point  s  in  the 
interval  between  s'  and  s'7  in  Frame  1.  The  points  u  and  s 
are  related  geometrically  as  in  the  figure,  or  algebraically  by 
(4).  Performance  functional  (6)  requires  computing  this  error 
I,(s)  -  I2(h(s,  b,  z(s,  a))  ). 

We  make  the  following  interesting  observations.  From 
the  geometry  of  image  formation  in  Fig.  2,  the  varying  scale 


change  that  maps  the  image  patch  over  interval  [s'  ,  s"  ]  in 
Frame  1  into  the  image  patch  over  interval  [u'  ,  u'7)  is  seen. 
Note  that  both  local  scale  change  and  translation  are  involved 
in  this  2-D  illustration. 

If  the  incorrect  a  is  used  in  computing  the  performance 
functional  (6),  the  patch  of  image  used  in  Frame  2  is  that  over 
the  interval  [u'  ,  u77  ].  Note  that  this  interval  is  both  a  shift 
and  a  varying  scaling  of  the  interval  [u'  ,  u7/  ].  If  instead  of  a 
sphere,  we  were  dealing  with  a  planar  surface,  the  scale  change 
would  be  constant  throughout  the  image. 

This  interpretation  is  further  illustrated  by  the  following 
3-D  computer  simulations.  Figs.  3a, b  are  two  frames  illustrat¬ 
ing  images  of  the  same  sphere  but  taken  at  different  locations 
and  orientations.  The  angle  between  the  optical  axes  of  the 
two  cameras  is  45°.  The  data  was  generated  by  taking  a  real 
image  with  a  T.V.  camera  and  projecting  the  image  pattern 
onto  a  sphere,  from  a  few  different  directions.  The  pattern 
projected  onto  the  sphere  in  this  way  is  the  pattern  that  is 
then  viewed  by  the  two  cameras  to  create  Frames  1  and  2. 
Note  that  all  of  these  projections  are  done  by  computer  simu¬ 
lation.  The  image  patch  used  in  Frame  1  is  the  square  shown 
there.  Point  P  is  its  center  point.  Points  P  in  Frames  1  and  2 
are  locations  on  images  of  the  same  point  on  the  3-D  sphere 
surface.  Note  how  the  image  in  the  vicinity  of  Point  P  in 
Frame  2  is  a  varying  scaled  transformation  of  the  image  in  the 
vicinity  of  Point  P  in  Frame  1.  In  Fig.  4,  we  refer  to  the  six 
image  boxes  as  I  thru  6  starting  with  the  upper  left  and  mov¬ 
ing  left  to  right  and  top  to  bottom.  Box  1  is  the  image  shown 
in  the  square  window  in  Fig.  3a.  The  system  begins  with  a 
guess  as  to  a.  This  initial  guess  is  shown  as  that  associated 
with  Box  2  in  Table  1.  Using  this  a,  for  each  point  s  in  the 
window  in  Fig.  3a  the  system  takes  the  image  at  point 
h(s,  b,  z(s,  a))  in  Fig.  3b  and  puts  this  picture  function 
value  at  location  s  in  an  array.  The  image  so  formed  is  that  in 
Box  2.  It  is  the  difference  of  the  picture  functions  of  these  two 
images  that  enters  as  Ij(s)  -  I2(h(s,  b,  z(s,  a)))  in  (6).  Box  3 
is  the  image  formed  in  this  way  using  alf  the  parameter  vector 
following  the  first  iteration  of  our  parameter  estimation  algo¬ 
rithm.  In  Boxes  4  and  5  are  the  transformed  images  associated 
with  a  values  at  intermediate  stages  of  the  estimation  process. 
Finally,  Box  6  is  the  image  associated  with  the  a  value  found 
at  the  last  estimation  stage.  These  a  values  for  the  stages 
associated  with  the  six  boxes  are  shown  in  Table  1. 

H.  Shane  of  the  Error  Function 

Figures  5  and  6  show  the  shape  of  the  error  function 
eD(&)  for  the  experiment  described  in  Sec.  G.  Fig.  5  shows  the 
graph  of  the  error  function  produced  by  holding  z0  and  R  fixed 
at  the  true  values  and  varying  x0  and  y0  over  the  ranges 
-150<x0<200  and  -I70<y0<170,  where  at  is  given  in  table 

1.  Fig.  6  shows  the  same  error  function  with  x0  and  R  held 
fixed  and  y0  and  z0  varied  over  the  ranges  -90<y0<120  and 
-120<z0<120.  Note  that  the  error  function  changes  much 
more  rapidly  with  change  in  z0  than  with  change  in  x0  or  y0, 
where  z0  is  the  distance  from  the  sphere  center  to  the  camera. 

I.  Maximum  Likelihood  Estimation  of  at 

Maximum  likelihood  or  other  standard  estimation  types 
can  be  implemented  for  a*.  By  making  and  using  assumptions 
concerning  the  reflected  light  intensity  at  the  sphere  surface,  it 
is  possible  to  get  better  estimates  of  at.  This  could  be  of  value 
when  the  image  data  is  very  noisy.  To  illustrate  these  possi- 
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bilities,  we  assume  the  pattern  is  a  fixed  but  a  priori  unknown 
function  of  r  =  (x,  y,  z)T  —  points  on  the  sphere  surface  — 
given  with  respect  to  CRF1.  Let  pt(r)  denote  this  pattern. 
Then  the  unknowns  to  be  estimated  are  at  and  /i(r). 

In(u),  the  nth  image  picture  function,  is  assumed  to  be  a 
Gaussian  r.v.  with  variance  cr2  and  mean 
E[In(u)j  =  fi( s,  z(sj  at)),  where  s  is  such  that 
u  =  h(s,b(n),  z(s,  a^)).  Here,  b(n)  specifies  the  transforma¬ 
tion  from  the  ORF  to  CRFn.  Hence,  the  joint  likelihood  of 
ID(u),  ueD(n),  given  a  and  /i(s,  z(s,  a))  is 

(2n-(r2)"M2exp  XI  |-(l/2<r2)[IQ(u)  -  /t(s,  z(s,  a))  )]2  )  (13) 

uc£>(n)  ^  ' 

The  joint  likelihood  of  In(u),  ueD(n),  n  —  1,...,N,  given  a  and 
/i(s,z(s,a))  is  just  the  product  of  the  expressions  (13),  and  is 

(2iw3rNMaexp  X  X  (-(l/2cr2)[I0(u)-p(s>  z(s,  a)))44) 

11=1  \ieE\n)  '  ' 

The  maximum  likelihood  estimate  of  at  and  /*(s,  z(s,  aj)  is 
simply  those  values  for  which  (14)  is  a  maximum.  For  a,  the 
maximum  likelihood  estimate  of  at,  the  maximum  likelihood 
estimate  of  //.( s,  z(s,  a))  is  given  by 

M 8,  z(s,  a))  =  N_1(s)  X  I„(h(s,  b(n),  ,z(s,  a))  )  (15) 

n=l 

and  such  that 
h(s,b(n),z(s,  a))cD(n) 

Note  that  N_1(s)  is  the  number  of  images  in  which  point 
(s,  z(s,  a))  on  the  sphere  surface  is  seen  and  is  used  in  the 
estimation;  jx( s,  z(s,  a))  is  the  average  of  the  picture  func¬ 
tions  at  the  points  assumed  to  be  views  of  the  location 
(s,  z(s,  a))  on  the  sphere  surface. 

The  major  difference  between  estimating  at  by  minimiz¬ 
ing  (11)  and  estimating  it  by  maximizing  (14)  is  that  in  the 
former  case  Ij(h(s,  b(j),  z(s,  a)),  j  —  n-1,  n+1  are  each  sub¬ 
tracted  from  In(h(s,  b(n),  z(s,  a))  and  squared  in  (11), 
whereas  in  the  latter  case  the  right  side  of  (15)  is  subtracted 
from  IB(h(s,  b(n),  z(s,  a))  and  squared  in  (14). 

J.  3-D  Surface  Type  Segmentation,  and  Primitive 
Object  Type  Recognition 

Assume  the  image  IQ(u)  is  a  combination  of  the  undis¬ 
torted  image  of  the  3-D  object  surface,  plus  stationary,  zero- 
mean,  white  Gaussian  noise  of  a  priori  unknown  variance  a 2. 
Further,  assume  the  noise  is  independent  from  one  image  to 
the  next.  The  noise  model  accounts  for  sensor  noise,  A/D 
quantization  noise,  and  other  random  perturbations  or  uncer¬ 
tainties  in  the  system.  Then 

w(s,  at)  =  I i(s)  -  I2(h(s,  b,  z(s,  at))  )  is  a  Gaussian  random 
variable  (r.v.)  with  mean  0  and  variance  2a2;  and  the  w(s,  at), 
seZ),  are  i.i.d.  r.v.’s,  providing 

h(s'  ,  b'  ,  z(s'  ,  at))^h(s",  b,  z(s" ,  at))  (16) 

for  s'  y£sil  and  s'  ,  stf  tD.  (If  I2(u)  is  a  compression  of  Ij(u), 
it  is  likely  that  there  will  be  pairs  s'  ,  s,r  for  which  (16)  is  not 
true.  In  such  a  case  it  is  necessary  to  purge  one  of  these  pixels 
from  D  for  the  purpose  of  the  test  to  be  discussed  now  for 
testing  whether  the  window  is  a  view  of  one  primitive  3-D 
object  surface  or  is  a  view  of  two  or  more  such  surfaces.) 
When  a^at,  w(s,  a),  stD,  are  not  i.i.d.  Furthermore,  if  two 
or  more  surfaces  are  seen  within  the  window  in  Frame  1, 


w(s,  a),  s tD,  will  not  be  i.i.d. 

Let  D1  denote  a  subset  of  D  such  that  all  pixels  in  it 
satisfy  (13),  and  let  L  be  the  number  of  pixels  in  D'  .  Then 
the  w(s,  a)  are  approximately  i.i.d  Gaussian  r.v.’s  where  a  is 
the  maximum  likelihood  estimate  of  at.  Hence,  a  test  that 
should  be  reasonably  good  for  testing  the  hypothesis  that  a 
piece  of  one  primitive  object  is  seen  in  the  image  window  D1 
against  the  alternative  that  pieces  of  two  or  more  primitive 
objects  are  seen  in  Df  or  that  a  differs  considerably  from  at,  is 
a  test  for  the  hypothesis  that  w(s,  a),  s eD*  are  i.i.d. 

Standard  t-tests  or  F-tests  can  be  used  for  this  purpose.  The 
general  quadric  surface  model  can  be  used  and  its  parameter 
vector  at  estimated,  or  a  less  careful  test  can  be  implemented, 
once  for  each  of  the  three  models  —  sphere,  cylinder,  plane. 

Finally,  asymptotically  Bayesian  recognition  is  possible 
by  using  the  type  of  asymptotic  results  discussed  in  [4]  and 
more  extensively  in  [5] . 

Conclusions 

A  parametric  modeling  and  statistical  estimation 
approach  has  been  proposed  and  some  simulation  shown  for 
estimating  3-D  object  surfaces  from  images  taken  by  calibrated 
cameras  in  two  or  more  positions.  The  parameter  estimation 
suggested  is  gradient  descent,  though  exhaustive  search  is  also 
possible.  Much  of  the  machinery  of  statistical  signal  analysis 
can  be  brought  to  bear  in  our  approach.  Processing  image 
data  in  blocks  (windows)  is  central  to  our  approach.  An 
important  question  of  interest  is  what  should  these  block  sizes 
be  in  order  to  achieve  acceptable  parameter  estimation  accu¬ 
racy.  The  required  block  size  is  a  function  of  image  noise,  3-D 
object  surface  shape,  and  pattern  reflected  at  the  object  sur¬ 
face.  These  and  other  questions  are  under  study. 
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